{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Isi Penting Generator article style"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate a long text with article style given isi penting (important facts)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/isi-penting-generator-article-style](https://github.com/huseinzol05/Malaya/tree/master/example/isi-penting-generator-article-style).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "Results generated using stochastic methods.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.66 s, sys: 4.06 s, total: 6.73 s\n",
      "Wall time: 1.96 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import malaya\n",
    "from pprint import pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List available HuggingFace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'mesolitica/finetune-isi-penting-generator-t5-small-standard-bahasa-cased': {'Size (MB)': 242,\n",
       "  'ROUGE-1': 0.24620333,\n",
       "  'ROUGE-2': 0.05896076,\n",
       "  'ROUGE-L': 0.15158954,\n",
       "  'Suggested length': 1024},\n",
       " 'mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased': {'Size (MB)': 892,\n",
       "  'ROUGE-1': 0.24620333,\n",
       "  'ROUGE-2': 0.05896076,\n",
       "  'ROUGE-L': 0.15158954,\n",
       "  'Suggested length': 1024}}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.generator.isi_penting.available_huggingface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load HuggingFace\n",
    "\n",
    "Transformer Generator in Malaya is quite unique, most of the text generative model we found on the internet like GPT2 or Markov, simply just continue prefix input from user, but not for Transformer Generator. We want to generate an article or karangan like high school when the users give 'isi penting'.\n",
    "\n",
    "```python\n",
    "def huggingface(\n",
    "    model: str = 'mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased',\n",
    "    force_check: bool = True,\n",
    "    **kwargs,\n",
    "):\n",
    "    \"\"\"\n",
    "    Load HuggingFace model to generate text based on isi penting.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model: str, optional (default='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased')\n",
    "        Check available models at `malaya.generator.isi_penting.available_huggingface`.\n",
    "    force_check: bool, optional (default=True)\n",
    "        Force check model one of malaya model.\n",
    "        Set to False if you have your own huggingface model.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: malaya.torch_model.huggingface.IsiPentingGenerator\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a5f8a10a9f0e45f18dad98726eca0a73",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b99c9d303cd84bc69e4ae0716c4c3901",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading spiece.model:   0%|          | 0.00/803k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "553f9ff9c454401284563eba2ac16ef1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n",
      "You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9b40e67d35ee43c1b4923e1efc793a7c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)lve/main/config.json:   0%|          | 0.00/822 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "548c953221dd433e9c42234ac1af292a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "model = malaya.generator.isi_penting.huggingface()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### generate\n",
    "\n",
    "```python\n",
    "def generate(\n",
    "    self,\n",
    "    strings: List[str],\n",
    "    mode: str = 'surat-khabar',\n",
    "    **kwargs,\n",
    "):\n",
    "    \"\"\"\n",
    "    generate a long text given a isi penting.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    strings : List[str]\n",
    "    mode: str, optional (default='surat-khabar')\n",
    "        Mode supported. Allowed values:\n",
    "\n",
    "        * ``'surat-khabar'`` - news style writing.\n",
    "        * ``'tajuk-surat-khabar'`` - headline news style writing.\n",
    "        * ``'artikel'`` - article style writing.\n",
    "        * ``'penerangan-produk'`` - product description style writing.\n",
    "        * ``'karangan'`` - karangan sekolah style writing.\n",
    "\n",
    "    **kwargs: vector arguments pass to huggingface `generate` method.\n",
    "        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[str]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Good thing about HuggingFace\n",
    "\n",
    "In `generate` method, you can do greedy, beam, sampling, nucleus decoder and so much more, read it at https://huggingface.co/blog/how-to-generate\n",
    "\n",
    "And recently, huggingface released https://huggingface.co/blog/introducing-csearch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "isi_penting = ['Neelofa tetap dengan keputusan untuk berkahwin akhir tahun ini',\n",
    "              'Long Tiger sanggup membantu Neelofa',\n",
    "              'Tiba-tiba Long Tiger bergaduh dengan Husein Zolkepli']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also can give any isi penting even does not make any sense."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Pada tahun 2013, Neelofa menjadi subjek iklan TV untuk jenama A.L.M. #2. '\n",
      " 'Pada bulan Jun 2017, Neelofa dan isteri Husein Zolkepli melancarkan kempen '\n",
      " 'untuk mempromosikan jenama A.L.M., syarikat teknologi maklumat yang '\n",
      " 'berpangkalan di Amerika Syarikat yang pertama. Dengan nama Tik Tok, ia '\n",
      " 'adalah jenama nombor satu di pasaran Amerika Syarikat.']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, do_sample=True, mode = 'artikel',\n",
    "    max_length=256,\n",
    "    top_k=50, \n",
    "    top_p=0.95, ))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Neelofa adalah anak lelaki sulung Neelofa yang masih hidup. Dia juga '\n",
      " 'merupakan anak perempuan sulung Neelofa yang masih hidup. Pada tahun 2017, '\n",
      " 'dia berkahwin dengan bekas isterinya, bekas isterinya, Husein Zolkepli. '\n",
      " 'Pasangan ini telah berkahwin selama lebih dari satu tahun.']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, mode = 'artikel',\n",
    "    do_sample=True, \n",
    "    max_length=256, \n",
    "    penalty_alpha=0.8, top_k=4,))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "isi_penting = ['Astronomi (dari) adalah sains semula jadi yang mengkaji objek dan fenomena cakerawala.',\n",
    " 'Ia menggunakan matematik, fizik, dan kimia untuk menjelaskan asal usul dan evolusi mereka.',\n",
    " 'Objek yang menarik termasuk planet, bulan, bintang, nebula, galaksi, dan komet.',\n",
    " 'Fenomena yang relevan termasuk letupan supernova, ledakan sinar gamma, kusar, blazar, pulsar, dan radiasi latar belakang gelombang mikro kosmik.',\n",
    " 'Secara umum, astronomi mengkaji semua yang berasal dari luar atmosfer Bumi.',\n",
    " 'Kosmologi adalah cabang astronomi.',\n",
    " 'Ia mengkaji Alam Semesta secara keseluruhan.']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Astronomi adalah sains perubatan yang sangat berkaitan dengan sains alam '\n",
      " 'sekitar. Astronomi adalah kajian mengenai objek, fenomena, dan proses alam '\n",
      " 'sekitar di seluruh alam semesta. Astronomi adalah bidang yang sangat penting '\n",
      " 'di seluruh dunia. Di kebanyakan negara, astronomi berfungsi sebagai cabang '\n",
      " 'sains, di mana sains, undang-undang, dan matematik telah digunakan untuk '\n",
      " 'mengkaji objek tertentu. Pada masa lalu, sains fizikal dikira dengan '\n",
      " 'memasukkan maklumat mengenai alam sekitar sebagai sumber utama maklumat. Kra '\n",
      " 'strawberg dan Petersen menggunakan istilah \"astronomi\" untuk merujuk kepada '\n",
      " 'sains kimia. Ia juga mengkaji sifat fizikal ciptaan sains yang tidak dapat '\n",
      " 'ditentukan. Dalam bidang astronomi, bidang ini biasanya disebut sebagai '\n",
      " '\"astronomi\", dan merupakan bidang yang sangat penting dalam sains (atau '\n",
      " 'fizik) untuk mengkaji alam sekitar. Bidang sains lain termasuk sains suria, '\n",
      " 'fizik, astronomi, dan geofizik. Istilah ini diciptakan oleh ahli astronomi '\n",
      " 'Perancis Pierre Laporte. Dalam bidang fizik, ini termasuk astronomi, '\n",
      " 'geologi, geologi, biologi, astrofizik, dan fizik. Ini termasuk astronomi, '\n",
      " 'fizik, dan sains fizikal, yang merangkumi kajian objek (atau objek) yang '\n",
      " 'mempunyai kesan alam sekitar yang sangat kuat dan mendalam, seperti planet, '\n",
      " 'komet, dan nebula. Beberapa ahli astronomi menggunakan istilah yang berbeza '\n",
      " 'untuk menggambarkan']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, do_sample=True, mode = 'artikel',\n",
    "    max_length=256,\n",
    "    top_k=50, \n",
    "    top_p=0.95, ))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Astronomi adalah sains yang mengkaji objek dan fenomena cakerawala. '\n",
      " 'Astronomi mengkaji semua fenomena cakerawala. Astronomi adalah kajian objek '\n",
      " 'dan fenomena yang berasal dari luar atmosfer Bumi, seperti bintang, Bulan '\n",
      " 'dan bintang. Ia mengkaji semua objek yang berasal dari luar atmosfer Bumi, '\n",
      " 'termasuk nebula, bintang, dan komet. Astronomi mengkaji pelbagai fenomena '\n",
      " 'yang berasal dari luar atmosfer Bumi, termasuk letupan supernova, ledakan '\n",
      " 'sinar gamma, kusar, pulsar, dan radiasi latar belakang gelombang mikro '\n",
      " 'kosmik. Astronomi adalah kajian semua objek dan fenomena cakerawala. '\n",
      " 'Astronomi adalah sains yang mengkaji alam semesta secara keseluruhan. '\n",
      " 'Astronomi mengkaji semua yang berasal dari luar atmosfer Bumi, termasuk '\n",
      " 'galaksi, nebula, dan galaksi. Astronomi mengkaji semua fenomena yang berasal '\n",
      " 'dari luar atmosfer Bumi, termasuk pulsar, pulsar, dan radiasi latar belakang '\n",
      " 'gelombang mikro kosmik. Astronomi mengkaji semua objek dan fenomena '\n",
      " 'cakerawala, termasuk bintang, nebula, dan galaksi. Astronomi mengkaji alam '\n",
      " 'semesta secara keseluruhan, termasuk semua objek dan fenomena yang berasal '\n",
      " 'dari luar atmosfer Bumi, termasuk nebula, dan galaksi. Astrono']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, mode = 'artikel',\n",
    "    do_sample=True, \n",
    "    max_length=256, \n",
    "    penalty_alpha=0.8, top_k=4,))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
